SCIENTIFIC 

REPORTS 




OPEN 



SUBJECT AREAS: 

MOLECULAR 
ENGINEERING IN PLANTS 

DNA RECOMBINATION 

PLANT MOLECULAR BIOLOGY 

NEXTGENERATION 
SEQUENCING 



Received 
28 May 2013 

Accepted 
1 6 September 201 3 

Published 
3 October 2013 



Correspondence and 
requests for materials 
should be addressed to 
D.Z. (zhangdb@sjtu. 

edu.cn) 

* These authors 
contributed equally to 
this work. 



Characterization of GM events by insert 
knowledge adapted re-sequencing 
approaches 

Litao Yang'*, Congmao Wang 1 *, Arne Hoist-Jensen 2 , Dany Morisset 3 , Yongjun Lin 4 & Dabing Zhang 1 

1 Collaborative Innovation center for biosafety of GMOs, National Center for Molecular Characterization of GMOs, School of Life 
Science and Biotechnology, Shanghai Jiao Tong University. 800 Dongchuan Road, Shanghai 200240. P. R. China, 2 Norwegian 
Veterinary Institute. P.O.Box 750 Sentrum, 01 06 Oslo, Norway, 3 Department of Biotechnology and Systems Biology, National 
Institute of Biology. Vecna pot 111, Sl-l 000 Ljubljana, Slovenia, 4 National Key Laboratory of Crop Genetic Improvement and 
National Centre of Plant Gene Research, Huazhong Agricultural University, Wuhan 430070, P. R. China. 

Detection methods and data from molecular characterization of genetically modified (GM) events are 
needed by stakeholders of public risk assessors and regulators. Generally, the molecular characteristics of 
GM events are incomprehensively revealed by current approaches and biased towards detecting 
transformation vector derived sequences. GM events are classified based on available knowledge of the 
sequences of vectors and inserts (insert knowledge). Herein we present three insert knowledge-adapted 
approaches for characterization GM events (TT51-1 and Tlc-19 rice as examples) based on paired-end 
re-sequencing with the advantages of comprehensiveness, accuracy, and automation. The comprehensive 
molecular characteristics of two rice events were revealed with additional unintended insertions comparing 
with the results from PCR and Southern blotting. Comprehensive transgene characterization of TT51-1 and 
Tlc-19 is shown to be independent of a priori knowledge of the insert and vector sequences employing the 
developed approaches. This provides an opportunity to identify and characterize also unknown GM events. 

It is internationally agreed that genetically modified (GM) crops could be commercialized after thorough safety 
assessment and only if they are deemed safe 1 . Molecular characterization of transgene inserts at the chro- 
mosome level including the insert sequence, its localization, the number of inserts and its flanking sequences is 
essential for the safety assessment and labeling of GMOs 2 . Furthermore, transgene insertion is frequently assoc- 
iated with intended and unintended changes at the genomic, transcriptomic, proteomic and metabolomics levels, 
which potentially affects food/feed quality and safety 3 . Therefore, molecular characterization data on the com- 
plete insert sequences and their localization are particularly important both for developers, risk assessors and 
regulators of GM crops. These data also serve as a basis for the development and validation of the specific 
detection methods for GMOs monitoring 4 . 

Current legally required and commonly applied approaches to obtain molecular characterization data are 
limited to Southern blot and polymerase chain reaction (PCR) analyses, combined with standard sequencing of 
the functional (intended) insert(s) and flanking genomic DNA of the host 2 . These approaches are time consuming 
and their abilities to detect DNA sequence motifs are potentially limited by various factors, such as the existence of 
substitutions, insertions, deletions in the sequence, and the quantities and/ or sizes of the targets. Thus their output 
information are often only sub-optimal albeit the optimal input efforts. For example, the developer initially 
documented only one inserted copy of the expression cassette of the 5-enolpyruvylshikimate 3-phosphate synthase 
(EPSPS) gene in the soybean event GTS40-3-2 (Roundup Ready, OECD unique identifier [UI] = MON-04032- 
6) that was approved for commercialization in 1994 5 . Later, the rearrangement of the 3' NOS terminator junction 
and one unintended 70-base pair (bp) DNA fragment insertion were evidenced in this event, and the molecular 
characterization of GTS40-3-2 has been amended three times 6 8 . With the constant expansion of GMO research 
and development, there is an increasing number of reports about fields or foods/feeds containing illegally/ 
unknown released GMOs. Examples include but are not limited to the StarLink maize (CBH351, UI = ACS- 
ZM004-3), the GM rice events LL601 (UI = BCS-OS003-7), Kemingdao and Kefeng 6, and the FP967 flax (UI 
= CDC-FL001-2) cases, which caused public concerns and international trade disruptions. In addition, author- 
ities are confronted with the very difficult task to detect unknown GMOs for which no information is available 10 . 
With the development of high throughput next generation sequencing (NGS) technology, complete genome 
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sequences can be obtained at high sequencing depth at reasonable 
costs 11 . NGS approaches have proven to be powerful tools for dis- 
covering gene fusion, re-arrangements, DNA insertion, and struc- 
tural variations in different animal and plant samples 1216 , although 
the massive data processing is the challenge. De novo assembly of a 
large eukaryote genome is presently beyond the scope of transgene 
characterization, but this challenge may be mitigated in the foresee- 
able future. Importantly, the majority of crop plants for which trans- 
genes are developed have already been extensively studied. More or 
less complete genome assemblies for these are or will soon be publicly 
available (NCBI Genome Resources, http://www.ncbi.nlm.nih.gov/ 
genome; Beijing Genomics Institute, Plant research, http://www.ldl. 
genomics.cn/page/pa-plant.jsp). Compared to conventional trans- 
gene characterization, whole genome re-sequencing, targeted bioin- 
formatics analyses and limited de novo assembly emerges as a much 
simpler and more effective approach to transgene characterization. 

Herein we present approaches to further exploit DNA re-sequen- 
cing and bioinformatics to comprehensively characterize the inserts 
of GMOs also when the a priori (pre-existing) knowledge of the DNA 
sequence(s) of vectors and inserts is limited or even absent. Pace is 
one of the most important crops in the world and a staple food for a 
large share of humanity. The complete sequence of the relatively 
small (389 Mb) rice genome was the first crop genome to be pub- 
lished 17 . Since our research interests are transgene characterization 
and detection, and rice genetics, we chose the two aforementioned 
transgenic rice events, TT51-1 and Tic- 19, as examples in this study. 
The TT51-1 event is the first food crop that was approved for com- 
mercialization in China in 2009 18,19 , and the Tic- 19 event is in the 
pipeline for approval in China 20 . We also included an in silico mimic 
to validate the software for detection of unknown transgenes. 

Results 

Three bioinformatics modules and the analytical program adapted 
to GMOs of different classes of pre-existing insert knowledge. Since 
the knowledge of a GM available a priori is case dependent, we 
designed three different bioinformatics modules for data analysis 
(Fig. 1). Each is targeted to be fit for a given hypothetical scenario 
(cf. insert sequence knowledge [ISK] classes 2-4 in the report of 
Hoist-Jensen et al.) 9 . Module 1 is intended for use when the DNA 
sequence of the transformation vector is known (ISK-class 2). Module 
2 is intended for use when a DNA sequence database of genetic 
elements and transgene constructs from known GMOs is available 
and can be used as a reference library (ISK-class 3). Module 3 is 
intended for use when no knowledge of the DNA sequence of the 
vector and insert is available a priori (ISK-class 4 = unknown GMO). 
For application of any of the three modules the species reference 
genome sequence must be available as a reference. 

Initially, the whole genomic DNA is isolated and subjected to 
standardized paired-end sequencing. The paired-end reads are then 
grouped according to their mapped affinity to known reference 
sequences (A to E; Fig. la). 

In module 1 (Fig. lb), four consecutive steps are included as 
follows: i) After processing the raw data (including filtering and 
adaptor trimming), all NGS reads are mapped back to the host gen- 
ome sequence to identify paired- end reads of type A or putatively to 
the types B, D, and E; ii) Then reads not classified as type A are 
mapped to the known transgenic vector sequence to assign reads 
to types B, C, D or E); iii) The transgene integration site(s), number 
of inserts, and flanking sequences are then determined by analysis of 
the type B, C, D and E reads; iv) Finally, the insert is verified using 
common PCR and Sanger sequencing analysis. 

In module 2 (Fig. lc), five steps are involved: i) Construction of a 
DNA transgene sequence library including frequently used exogen- 
ous genes, regulatory elements, marker genes, and vectors from dif- 
ferent open sources (publications, patents, and available databases); 
ii) Mapping all reads back to the transgene sequence library; iii) 



Individual de novo assembly of matched reads and their paired reads; 
iv) Uncovering the transgene sequences of the sequenced sample and 
drawing of the sketch map of DNA insert(s) on the basis of the 
assembled contigs; v) Experimental confirmation of the transgenic 
inserts by conventional PCR and Sanger sequencing. Notably, there 
is no separate mapping to identify reads of type A prior to the de novo 
assembly in step iii. This is to ensure comprehensive transgenic insert 
retrieval also in cases where cis-genic (host-derived) elements are 
inserted. If type A reads are removed from the set of reads to be 
assembled, this could improve the data processing time at the cost 
of cis-gene elements detection. 

In module 3 (Fig. Id), four steps are involved: i) Mapping of the 
reads against the reference genome to subtract type A reads; ii) direct 
de novo assembly of remaining reads; iii) BLAST analysis of all 
inferred contigs; iv) experimental verification of inserts by conven- 
tional PCR and Sanger sequencing. 

In order to simplify the workload of these developed modules in 
massive data analysis, we then developed one integrated (semi-) auto- 
mated program to implement the three modules based on the basic 
principles of Burrows- Wheeler Aligner (BWA) and Assembly by 
Short Sequences (ABySS) algorithms 21,22 . It circumvents the need 
for multiple runs of step-by-step DNA alignment and de novo analysis 
to identify matching read-pairs and assemble putative insert contigs. 
The developed program is downloadable at the following URL: http:// 
gmdd.shgmo.org/Computational-Biology/Transeq/Transeq.tar.bz2. 

The characterization of GM rice events Tic- 19 and TT51-1 employ- 
ing the three modules. In order to confirm the applicability of the 
developed modules for characterization of biotech events two 
presumed representative transgenic rice events, Tlc-19 and TT51-1, 
were selected as case examples for further analysis. The event Tlc-19 
was transformed with the cloning vector pBar-lC (Supplemental 
sequence No. 1) into the rice cultivar Minghui 63 by Agrobacterium- 
mediated transformation. The event TT51-1 was co-transformed with 
two cloning vectors pFHBTl (Supplemental sequence No. 2) and 
pGL2RC7 (Supplemental sequence No. 3) into Minghui 63 by 
particle bombardment. Paired-end sequencing (90 bp reads) yielded 
8.97 Gb and 9.92 Gb of raw sequence data corresponding to approxi- 
mately 23.8 X and 26.4 X sequencing depth from Tlc-19 and TT51-1 
rice, respectively. The raw sequence reads of Tlc-19 and TT51-1 rice 
are available in the Sequence Read Archive (SRA; http://www.ncbi. 
nlm.nih.gov/sra) with accession number SRA057974. The raw data 
was imported into the developed bioinformatics program for further 
analysis with the three different modules. 

Tlc-19 rice. Using the module 1 (Fig. lb), ~ 75.4 million (75.76%) 
reads were attributed to the type A, both ends properly mapped to the 
rice reference genome (TIGR 7.0), while 2,613 reads were type C, 
matching to the transgenic vector pBar-lC (Table 1), and 128 chi- 
meric pairs of reads were Type B, D and E, under the filter parameters 
allowing maximum edit distance (including insertions, deletions or 
substitutions) 10 bp for each single read (Supplemental Table SI). A 
total of 1 1 1 pairs of reads were perfectly matched to rice chromosome 
(Chr) 1 1 and pBar-1 C, and 10 pairs of reads were perfectly mapped to 
rice Chr 04 andp_Bar-iC. These 121 pairs of reads are compatible 
with presence of two transgene inserts located on Chr 04 and Chr 11, 
respectively. The flanking sequences of these two transgene inser- 
tions were also obtained. Two pairs of reads mapped to Chr 12 and 
were regarded as false positive due to the repetitive genome sequence 
between the Chr 12 and Chr 11. The left five pairs of scattered reads 
mapped to other chromosomes (Chr 01, Chr 10, and Chr 05) were 
observed and regarded as artefacts because of sequence and mapping 
specificity (Supplemental Table SI). Relaxing the stringency of the 
mapping parameters only slightly affected the number of chimeric 
read-pairs (Supplemental Table SI). In order to validate the above 
mentioned results, thirteen primer pairs were designed based on 
the obtained 128 paired reads (Supplemental Table S2). PCR was 
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Figure 1 | The three modules proposed for molecular characterization of transgenic lines using paired-end whole genome re-sequencing and data 
analysis. Module 1 (Fig. la-lb): the complete DNA sequence of the transformation vector is available (corresponding to insert sequence knowledge [ISK] 
class 2 scenarios) . The paired-end reads are characterized into five types (A to E). Type A: both paired-ends perfectly map back to the host genome. Type B: 
one end matches to the host genome, the other to the transgene. Type C: both paired-ends match to transgene. Types D and E: one end matches to host 
genome or transgene, and the other spans the junction region between host genome and transgene. Module 2 (Fig. lc): the DNA sequence of the 
transformation vector is not available but the transgene insert is expected to contain at least one genetic element that is included in a transgene element 
sequence library (database; corresponding to ISK-class 3 scenarios). Successful detection and characterization of the transgene depends on matches 
between the transgene and the sequence library. Module 3 (Fig. Id): no DNA sequence information on the transgene insert is available and a transgene 
element sequence library is expected to be of limited use (cf. ISK-class 4) . Successful detection and characterization depends on efficient de novo assembly 
and contig analyses. 
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Table 1 


Results of analysis of Tlc-1 9 and TT51 -1 rice events using module 1, including insert number estimation 
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performed with these primers employing the Tic- 19 rice and its 
isogenic control line Minghui 63 as templates, which led us to con- 
clude that except for the Chr 04 and Chr 1 1 matches, all matches were 
artefacts and did not represent additional inserts or transgene rear- 
rangements (Fig. 2a, b). Supportively, the long distance PCR amp- 
lification and Sanger sequencing results confirmed the above two 
transgene insertions. A 6394 bp and a 6392 bp T-DNA insertions 
were found to be inserted at 31,763,777 on Chr 04 and at 1,124,835 on 
Chr 11, respectively (Supplemental sequence No. 4 and No. 5). The 
sequence alignment analysis authenticated the two intact T-DNA 
insertions in the rice host genome (cf. Fig. 3a). To check for 
putative integration of transgene vector backbone sequence, a 
coverage depth plot of reads mapped against the pBar-lC vector 
was prepared (Supplemental Fig. SI). No indications of vector 
backbone sequence insertion were observed in Tic- 19 rice. The 
number of T-DNA inserts was also verified using equation 1 (see 
details in Materials and Methods section). The inferred insert 
detection index (X) was 1.02 (=1.0), indicating that all inserts 
derived from the transformation vector were retrieved (Table 1). 
Altogether, these results suggest the feasibility of the developed 
strategy (module 1 plus program) in the comprehensive molecular 
characterization of the GMO using re- sequencing. Under ISK- 
class 2 conditions 9 , we conclude that the input effort vs. output 
information ratio of the approach presented here is superior to the 
currently applied and legally required approaches including 
Southern blot and PCR analyses. 



In the analysis with module 2 (Fig. lc), the rice Tic- 19 was treated 
as a blind (unknown) sample. A transgene element library was con- 
structed containing the DNA sequence of 134 elements frequently 
used in GMOs (promoters, terminators, genes, cloning vector ele- 
ments) retrieved from published literature, patents, and databases 
(Supplemental Table S3). A total of 29,861 pairs of reads 
(Supplemental Table S4) mapped to 25 elements in the constructed 
library, indicating the presence of transgene(s), such as the maize 
Ubiquitin promoter, CaMV35S promoter, CaMV35S polyA termin- 
ator, NOS terminator, and the Bar gene. De novo assembly of the 
29,861 pairs reads yielded 22 contigs with length > 90 bp (Supple- 
mental Table S5). BLASTN analysis of these contigs against the 
GenBank database showed the presence of several genetic elements 
frequently used in transgene constructions (maize Ubiquitin pro- 
moter, CaMV35S promoter, CaMV35S PolyA terminator, NOS ter- 
minator, Lac Z, CrylC gene, and Bar gene), rice endogenous genes 
(Sucrose Phosphate Synthase gene, Acetyl-CoA carboxylase gene, and 
Actin promoter), and the insert junctions on Chr 1 1 (Supplemental 
Table S5). Notably, the CrylC gene was not included in the transgene 
element library but was also among the observed putative insert 
elements of assembled contigs. The insert-junctions on Chr 04 
detected in module 1 analysis were not identified in this approach. 
A sketch map of the transgene insert(s) was inferred (as shown in 
Fig. 3b) and successfully confirmed with PCR and Sanger sequencing 
(Supplemental sequences No. 4 and 5). The results confirmed the 
feasibility and practicability of module 2 for characterization and 
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Figure 2 | Deduced transgene loci andPCR confirmation ofthe insertions in GM rice Tlc-19. (a) transgene locus on Chr 04; (b) transgene locus on Chr 
11. 
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Figure 3 | Comparison of inferred transgene maps for the rice events Tlc-19 (Figs. 3a-3c and TT51-1 (Figs. 3d-3f), and for the mimic insert spiked 
in silico into the Tlc-19 rice event raw sequence data reads (Figs. 3g-3h). Top: the correct maps, obtained with module 1 for the two rice events 
(Figs. 3a and 3d), and spiked in silico (Fig. 3g). Middle: the results inferred with module 2 for the two rice events (Figs. 3b and 3e). Bottom: the results 
inferred with module 3 for the two rice events (Figs. 3c and 3f) and the spike (Fig. 3h). 

identification of GMOs referable to ISK-class 3 9 , although its success element library, and an update relative to the present version is 
depends, to some extent, on the constructed GMO library. The more available upon request. 

complete the transgene element library, the more powerful the appli- The module 3 pipeline (Fig. Id) was developed to deal with 
cation of module 2. We are continuously improving the transgene unknown GMOs that correspond to ISK-class 4 scenarios", in which 
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the information on the transgenic insertion is not available and the 
constructed GMO library cannot be used either. A total of 386,520 
contigs (> 100 bp in length) were de novo assembled from all non- 
type A reads using the developed semi- automated program. The 
number and the length of assembled contigs are plotted in 
Supplemental Fig. S2. BLASTN analysis of all contigs demonstrated 
that one 6,190 bp contig (Fig. 3c; Supplemental Sequence No. 6) had 
high similarity to the frequently used transgenic elements in trans- 
formation vectors (CaMV35S promoter and terminator, maize 
Ubiquitin promoter, and NOS terminator) and exogenous genes 
(CrylC, Bar and LacZ genes; as shown in Supplemental Fig. S3). 
We then compared the 6,190 bp contig with the known transgene T- 
DNA insert (cf. module 1 analysis above) and found that they were 
highly similar. This indicates that the module 3 pipeline is a very 
effective alternative for the molecular characterization of GMOs 
from ISK-class 4 scenarios except for the slightly large workload 
for blasting and filtering the assembled contigs (here 386,520). 

TT51-1 rice. Using the module 1 (Fig. lb), around 85.3 million 
(77.32%) reads were type A, and 2,428 reads were type C, matching 
to the cloning vectors pFHBTl &ndpGL2RC7 (Table 1). In addition, 
228 chimeric reads were Type B, D, or E, including 81 pairs on Chr 
03, 45 pairs on Chr 10, 38 pairs on Chr 05, 18 pairs on Chr sy, 13 pairs 
on Chr 04, 7 pairs on Chr 09, 6 pairs on Chr 08, 5 pairs on Chr 06, 5 
pairs on Chr 11,3 pairs on Chr 02, 3 pairs on Chr 12,2 pairs on Chr 
07, 1 pair on Chr 1, and 1 pair on Chr un (Supplemental Table S6). 
All these obtained paired reads were grouped based on the associated 
chromosome number, and the paired reads of each group were used 
to reveal the transgene insertion sites and flanking sequences indi- 
vidually. The results showed that two transgene insertions occurred 
in TT51-1 rice, one located at position 2,640,325 on Chr 04, and 
another located at position 5,697,885 on Chr 10. The 81 pairs of 
reads on Chr 03 were obtained because the rice endogenous Actin 
promoter is present in both the rice genome and the T-DNA inserted 
of the vector pFHBTl. Other scattered reads that were mapped to 
other chromosomes (Chr 01, Chr 02, Chr 06, Chr 07, Chr 08, Chr 09, 
Chr 11, Chr 12, Chr sy, and Chy un) were also observed and regarded 
as artefacts, after considering sequence and mapping specificity or 
the homogenous or repeat sequences among the 12 chromosomes in 
the rice genome (Supplemental Table S5). Relaxing the stringency of 
the mapping parameters only slightly affected the number of chi- 
meric read-pairs (Supplemental Table S5). In order to confirm the 
above results, twenty primer pairs were designed based on the 
obtained 228 pairs of reads (Supplemental Table S2). PCR was per- 
formed with these primers employing the TT51-1 rice and its iso- 
genic control line Minghui 63 as templates. The PCR and sequencing 
results indicated that except for the Chr 04 and Chr 10 matches, all 



matches were artefacts and did not represent additional insertions or 
transgene rearrangements (Fig. 4a, b). PCR amplification and Sanger 
sequencing results confirmed the above two transgene inserts and 
provided details on the integrated T-DNA sequences. On Chr 10, one 
long (9,818 bp) tandem CrylAb/c cassette derived mainly from 
pFHBTl was found to be integrated into the rice genome at position 
5,697,885. It included partial r Actin promoter, two tandem Cryl Ab/c 
cassettes, and partial hpt gene (Fig. 3d; Supplemental sequence No. 
7). This large insert was compatible with rearranged integration of 
one minor pGL2RC7 derived motif and two major and three minor 
partial pFHBTl derived contigs into one large insert. On Chr 04, one 
short transgene insert (599 bp) consisting of partial r Actin promoter 
and CrylAb/c gene derived from pFHBTl was found to be integrated 
at position 2,640,325 (Fig. 3d; Supplemental sequence No. 8). To get 
an overview of which parts of the cloning vectors that apparently had 
been integrated into TT51-1 rice, coverage plots of reads mapped 
against the pFHBTl and pGL2RC7 vectors were prepared Supple- 
mental Fig. S4 and S5). These plots indicated the presence of almost 
all vector elements, except the CaMV35S promoter and hpt gene 
from the pGL2RC7 vector. Residual backbone sequences (bla and 
hpt gene and multiple cloning sites) were located to the Chr 10 
insertion sequence (Supplemental sequence No. 7). The number 
of T-DNA inserts was verified using equation 1 (see Materials and 
methods) and the inferred insert detection index (X) was 1.027 
( — 1.0), indicating that all inserts derived from the transformation 
vectors were retrieved (Table 1). Altogether these results suggest that 
we have provided a full molecular characterization of TT51-1 rice, 
with significant improvements compared to previous reports on 
TT51-1 rice 18,23 . As described above for Tlc-19 rice, re-sequencing 
combined with the bioinformatics analysis with module 1 provides a 
novel approach for effective and comprehensive molecular char- 
acterization of GMOs. 

In the analysis with module 2 (Fig. lc), the rice TT5 1-1 was treated 
as a blind (unknown) sample. A total of 11,157 pairs of reads 
(Supplemental Table S7) mapped to 14 elements in the constructed 
transgene element library, indicating the putative presence of trans- 
gene^), such as the NOS terminator, bla gene, CrylAb/c gene, and 
hpt gene. De novo assembly of the altogether 11,157 pairs reads 
yielded 12 contigs with the length > 90 bp. BLASTN analysis of 
these contigs against the GenBank database showed the presence 
of several frequently used transgenes and elements (NOS terminator, 
CrylAb/c gene, bla gene, and hpt gene), rice endogenous genes 
(Sucrose Phosphate Synthase gene and Acetyl-CoA carboxylase gene), 
and two insert junctions on Chr 10 that were also retrieved with 
module 1 (Supplemental Table S8). The insert junctions on Chr 
04 (cf. module 1 analysis above) were not identified with this 
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Figure 4 | Deduced transgene loci and PCR confirmation ofthe insertions in GM rice TT51-1. (a) transgene locus on Chr 04; (b) transgene locus on Chr 
10. 
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approach. A sketch map of the transgene insertions was inferred (as 
shown in Fig. 3e) and successively confirmed with PCR and Sanger 
sequencing. As with Tic- 19 rice, the results confirmed the feasibility 
and practicability of re-sequencing combined with bioinformatics 
analysis with module 2 for characterization and identification of 
GMOs referable to ISK-class 3. 

For the module 3 pipeline (Fig. Id) the scenario is of ISK-class 4 
type (unknown GMO). From the set of non-type A reads, a total of 
212,515 contigs (> 100 bp in length) were obtained by de novo 
assembly. The number and the length of assembled contigs are plot- 
ted in Supplemental Fig. S6. BLASTN analysis of all contigs demon- 
strated that two contigs have high similarity to frequently used 
transgenic elements of transformation vectors. As was the case for 
Tlc-19 rice also TT51-1 rice therefore does not meet the module 3 
assumption that the transgene does not contain elements of the 
transgene element library for module 2. We will come back to this 
in the discussion below. One contig of 3,454 bp in length 
(Supplemental Sequence No. 9), included the transgene elements 
of CrylAb/c gene, NOS terminator, and bla gene flanked by or 
including the rice Actin promoter (Supplemental Fig. S7). The other 
contig of 1,478 bp (Supplemental Sequence No. 10) included a 
putative insert junction on Chr 10, corresponding to the 3' end of 
the insert detected with modules 1 and 2 (see above). Furthermore, 
the 1,478 bp contig included the commonly used cloning vector 
backbone sequences of the bla and hpt genes (Supplemental Fig. 
S8). Sketch maps of the transgene insertions were also drawn for 
the 3,454 and 1,478 bp contigs (as shown in Fig. 3f). According to 
the sketch maps and verified sequences of TT5 1-1 insert with the two 
used cloning vectors (pFHBTl and pGL2RC7), we confirmed that the 
rearrangement and truncation took place in the transformation pro- 
cess of particle bombardment 24 . As with Tlc-19 rice, the results also 
indicate that the module 3 pipeline is a very effective alternative for 
GMOs in the ISK-class 4 scenarios. 

The mimic transgene insertions from an artificial data set. To 

further confirm the ability of the developed module 3 to detect 
and characterize unknown transgene insertions, a third, artificial 
(simulated) dataset was created by in silico spiking of the Tlc-19 
dataset with mimicked reads corresponding to a 6424 bp contig 
containing one long (5,191 bp) and one short (89 bp) insert of 
"unknown" GM origin in the rice SPS gene (Fig. 3g; Supplemental 
Table S9; Supplemental Sequence No. 11). The "unknown" GM 
inserts of this mimic were almost entirely derived from soybean 
and the mimic did not contain any of the elements included in the 
constructed transgene element library or any cloning vector 
backbone elements. 

For the analysis of this data set no information on the hypothetical 
cloning vector used was available. Thus it would not be possible to 
use module 1 for the analysis. In a real life situation, it is not possible 
to exclude a priori the possibility that the data set includes one or 
more of the elements in the transgene library. Thus it is reasonable to 
use module 2 for the analysis. However, in this simulated case it was 
known a priori that module 2 would not yield true positives, and 
module 2 was not explored. 

With module 3, the spiked data set was input into the semi-auto- 
mated program, and a total of 387,518 assembled contigs (> 100 bp 
in size) were generated. Comparing with the 386,520 contigs from 
Tlc-19 rice data set (module 3 analysis above), more than 1,000 novel 
contigs were created. The BLASTN analysis of these new contigs 
revealed that two contigs of 5,200 bp and 238 bp, respectively 
(Supplemental Sequence No. 12-13), had high similarities to inser- 
tions into the rice SPS gene of DNA from soybean and other plant 
species. Successively, we compared the sketch maps of the two con- 
tigs with the mimicked insert. As illustrated in Fig. 3h, these contigs 
closely resembled the mimicked insertion. Detailed alignment of the 
5,200 bp contig with the simulated unknown insertion revealed that 



4,939 bp of the contig mapped to the 6,424 bp mimic with 99.96% 
identity, including the two junctions between the rice SPS sequence 
and the insert. However, a 159 bp deletion within this insert was 
incorrectly indicated in the 5,200 bp contig. Similar alignment of 
the 238 bp contig revealed that 89 bp of the contig completely 
mapped to the 89 bp mimic with 100% identity, and the two junc- 
tions between the rice SPS sequence and 89 bp mimic were also 
included. Combined with the two contigs of 5,200 bp and 238 bp, 
four short deletions were observed, i.e. the deletions of positions 1- 
418, 5358-5515, 5727-5902, and 6141-6424 of the mimic. However, 
these four deletions did not significantly affect the ability to char- 
acterize the transgene insertions. Concluding from the above results, 
the two mimic inserts were successfully retrieved with the pipeline of 
module 3 and the semi-automatic program. Notably one insert was 
as short as 89 bp. This should serve to demonstrate the power of the 
developed module 3 for the characterization of ISK-class 4 transgene 
inserts (unknown GMOs). 

Discussion 

The purpose of this study was to demonstrate the potential of high 
throughput sequencing for molecular characterization of GMOs 
under highly diverse a priori knowledge conditions. For this purpose 
three different bioinformatics modules were designed, each fit for a 
given insert-sequence knowledge class 9 . The results obtained from 
the combined analysis with all three modules on Tic- 19 rice indicate 
that this rice event contains only two, nearly identical, large complete 
inserts. The molecular characterization of these two inserts and their 
insertion loci in Tlc-19 rice are complete (module 1), nearly com- 
plete (module 2) or partial but sufficient to facilitate complete char- 
acterization (module 3). Notably, only one transgene insert 
(Supplemental sequence No. 4) was revealed for the event Tlc-19 
using state-of-the-art Southern blot and PCR techniques according 
to a published report 20 and cloned sequence (Genbank accession 
HQ161062.1). The results obtained with the combined use of all 
three modules on TT51-1 rice indicate that this rice event contains 
two inserts. The molecular characterization of the two inserts and 
their insertion loci in TT51-1 rice are complete (module 1), nearly 
complete (module 2) or partial but sufficient to facilitate complete 
characterization (module 3). One of the inserts in TT51-1 is short 
(599 bp) and unintended and has not been previously reported. The 
other is large (9818 bp) and rearranged. The partial sequence of the 
9818 bp insert has previously been reported 23 . However, significant 
errors in the previous description of the 5' -end of the large insert 
were observed in the present study. The corrected full-length insert 
map perfectly integrates two previously published and apparently 
unlinked accessions HQ161054 (5' flank/insert) and HQ161055 
(3' flank/insert). Obviously, the approaches reported here could be 
used to characterize GMOs at any ISK-levels. 

Module 1 proved to be extremely effective for the comprehensive 
detection and characterization of the transgene insertions (sequences 
of inserts, insertion loci and flanking sequences, and the number of 
inserts). It revealed one small insert (partial, truncated construct) 
and the complex rearranged long insert of TT51-1 rice. Module 1 
may be particularly useful for transgene developers wishing to obtain 
complete molecular data for further selection among related trans- 
gene events, for safety assessments, and for development of reliable 
event-specific detection methods. 

Module 2 provides a basis for rapid complete characterization of 
event specific inserts and flanking sequences. It has the potential to 
serve as an alternative GMO-screening approach, similar to but more 
advanced than the commonly applied so-called matrix approach 9 . 
Module 2 may be particularly useful to agricultural supply chain 
stakeholders in need of knowing if their products contain authorized 
or un-authorized GMOs. 

Module 3 lays a foundation for complete characterization of event 
specific insertions and flanking sequences of unknown GMOs, and 
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its suitability and validity has been demonstrated by the analyses of 
two GM rice events and mimic transgene insertion data. It is poten- 
tially of great value to public risk assessors and regulators for 
unknown GMOs inspection and monitoring. 

With the two investigated rice events Tic- 19 and TT51-1 the 
inserts contained several of the elements of our transgene element 
library. The contigs identified as the true inserts with the BLAST 
analyses were easily identified as transgenic due to their similarity 
to such elements. It could therefore be argued that these two rice 
events are unfit to assess the performance of module 3. We acknow- 
ledge this weakness associated with these two specific examples. 
However, the situation with the mimic transgene insertion was com- 
pletely different. The transgenic insert of the mimic was entirely 
composed of sequence elements which are absent in the transgene 
element library. This example therefore demonstrates that module 3 
is valid. 

For (semi-) automated analysis with module 3, several options will 
be explored and exploited to increase the success rate and efficiency 
of transgene insert detection/identification in future versions of the 
software. The contig size distributions suggest that the contigs of 
interest are usually among the longest contigs. With module 3 the 
a priori information on the transgene is absent or extremely limited. 
Complete insert detection and characterization may therefore be less 
important at this stage than is the case with GMOs that can be 
characterized with modules 1 or 2. Examining the longest contigs 
first and perhaps more carefully than the shorter contigs can there- 
fore be a reasonable approach, although it may result in failure to 
detect shorter transgenic inserts which cannot be represented by long 
contigs. Depending on the outcome of the BLAST analysis, there are 
alternative routes to transgene contig identification. Transgenes are 
by definition insertions of sequences from other taxa into the recipi- 
ent genome. Thus, transgene inserts are likely to appear as contigs 
showing substantial dissimilarity to the reference genome (including 
chloroplast and mitochondrial genomes). The (main) transgene 
insert of a GMO is usually transcriptionally functional and includes 
a coding element. The coding element in many GMOs is derived 
from cDNA and therefore lacks introns. Such a transgene would 
contain a relatively long open reading frame (ORF) compared to 
endogenous ORFs. Screening all the contigs for long ORFs, starting 
with the longest contigs could therefore contribute to faster iden- 
tification of the transgene contig(s). This approach would work inde- 
pendently of the result of BLAST analysis. It is, however, unfit for 
detection of genes containing introns. If the BLAST analysis returns 
annotated hits then another option emerges. The complete func- 
tional insert will normally include a promoter and usually also a 
terminator. This allows to study if a contig corresponds to a com- 
bination of promoter, gene and/or terminator elements. Notably and 
essentially by default, none of the elements found in the contig with 
the module 3 are included in the transgene element library of module 
2. Verification of the identity of transgene contigs after the bioinfor- 
matics analysis should in all cases be done experimentally, e.g. using 
conventional PCR and DNA sequencing. 

The re-sequencing and bioinformatics analyses outlined here are 
intended for samples derived from a single GM cultivar or species. 
The complexity of processing the combination of raw sequence data 
and necessary reference database accessions can be computationally 
prohibitive. This would particularly be true if a product containing 
ingredients of different species is analyzed. Intragenic inserts, if used, 
will be exceptionally difficult to detect without a priori knowledge of 
the DNA sequence of the transformation vector. We anticipate that 
increasing the sequence coverage can mitigate this challenge for 
modules 1 and partially also for module 2, but not for module 3. If 
a sufficiently large number of annotated reference cultivars are avail- 
able this can also to some degree improve the prospects for module 3. 

With module 1 the potential for detection and characterization of 
rearrangements of the recipient genome resulting from transformation 



can be maximized by re-sequencing the non-GM isogenic cultivar in 
paralleled. This is, however, not a prerequisite for successful compre- 
hensive detection and characterization of inserts and flanking 
sequences, as demonstrated in this study. 

Sequencing depth is positively correlated with the quality and 
quantity of data. The approaches outlined here can be further 
improved. Expansion of the transgene element library can improve 
the output of analyses with module 2, although for the two rice events 
examined in detail here the library was sufficiently complete to allow 
for complete insert characterizations. As discussed above, there is 
considerable potential for improvements of module 3 as well. The 
bioinformatics workload still represents a limitation to the routine 
use of modules 2 and 3 until the automatic and simplified software be 
developed in future. 

The two raw-data sets obtained from Tlc-19 and TT51-1 rice, 
respectively, are barely larger than the size of single data sets that 
can be obtained with the Illumina MiSeq sequencer. This fact is 
important because it suggests that it is possible to obtain the raw 
sequence data from a DNA sample within 1-2 working days with the 
new sequencer (Illumina MiSeq), as compared to 14 working days 
with the Illumina HiSeq sequencer used in the present study. Under 
optimal conditions we estimate that the total time required from 
receipt of sample material to presenting a detailed draft map of the 
insert (module 1) can be as little as three working days for a GMO 
with a small genome such as rice. Similarly, we estimate that without 
any a priori knowledge of the transformation vector or insert (mod- 
ule 3) a draft map of the insert excluding experimental verification 
can be obtained in as little as five working days with the help of 
automatic software in future. This would undoubtedly represent a 
significant improvement when time is critical, e.g. to manage risks 
and minimize damage from presence of unknown GMO in the sup- 
ply chain. 

Methods 

GM rice samples. The transgenic rice Tlc-19 was transformed with plasmid pBar-lC 
(Supplemental sequence No.l) carrying three trait genes (neomycin 
phosphotransferase II [nptll], CrylC and phosphinothricin acetyltransferase [Bar]). 
The latter two traits were transformed into the Oryza sativa indica cultivar Minghui 
63 by Agrobacterium-mediated transformation 20 . Tlc-19 exhibits high resistance to 
leaffolders (Cnaphalocrocis medinalis) and yellow stemborer {Tryporyza incertulas). 
The transgenic rice TT51-1 was co-transformed with two plasmids, pFHBTl 
(Supplemental sequence No.2) and pGL2RC7 (Supplemental sequence No.3), into 
Minghui 63 by particle bombardment. These plasmids carry the trait genes 
CrylAb(c), bla and hpt {pFHBTl) and rice chitinase 2 (Cht-2) and hpt (pGL2RC7), 
respectively. The bla and hpt genes are not transcriptionally active in plants. 
Transformation resulted in insertion of the CrylAb(c), bla and hpt trait genes and 
thus in high resistance against yellow stem borers and leaf-folders 18 . Both transgenic 
rice events and their isogenic control lines were developed and supplied by Huazhong 
Agricultural University, China. The GM rice samples were subjected to whole 
genome sequencing and further analysis of PCR amplification. The isogenic non-GM 
rice line (Minghui 63) was used as a control to validate the results from bioinformatics 
analysis. Homogeneous seeds of transgenic rice Tlc-19 and TT51-1 were planted in 
Wuhan (Hubei province, China) and harvested in September 2010. 

Paired-end sequencing. Total genomic DNA was extracted from rice seeds using the 
DNeasy Plant Mini Kit (Qiagen). The quality of the extracted rice genomic DNA was 
evaluated by UV-spectrophotometry with a Nanodrop ND-8000 instrument 
(Thermo Scientific) and 1% agarose gel electrophoresis. The sequencing DNA 
libraries were created using Illumina Paired-End library preparation kits and Illumina 
Oligos for adaptor ligation according to the manufacturer's instructions (Illumina). 
The mean fragment size was about 500 bp. The constructed DNA libraries were 
quantified by Pico-Green (Quant-iT; Invitrogen) and Agilent Bioanalyzer DNA 1000 
kit (Agilent Technologies). For both libraries 90 cycles of paired-end sequencing were 
performed on a single lane of an Illumina HiSeq 2000 DNA sequencer (Illumina). 
Paired-end sequencing and simple modification of the raw data to ensure sequencing 
quality were performed by BGI-Shenzhen (Shenzhen, China). Filtering was applied to 
remove reads of insufficient length, low read quality and to trim away adapter 
sequences before data analysis. 

In silico mimicking of unknown insertion event. To test the potential for detection 
of unknown inserts (no apriori knowledge of transformation vector or insert), a set of 
paired-end reads was generated in silico and spiked into the raw data obtained with 
Tlc-19 rice. First, a hypothetical unknown insertion event was designed by 
combining a part of contig 2 from the soybean event DP-305423-1 (patent 
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WO2008054747) with rearranged rice sucrose phosphate synthase gene (accession 
no. D45890). Contig 2 is mainly composed of soybean-derived elements. This 
hypothetical unknown event should mimic insertion and partial rearrangement of the 
insertion locus and insert. The detailed structure of the mimic is described in 
Supplemental sequence No. 11. The mimic has 6424 bp, with 1874 A, 1374 T, 
1213 C, and 1962 G. The simulated 90 bp paired-end reads were generated based on 
the mimic, corresponding to a sequencing depth of 23.8 X {Supplemental Table S9). 
The base error rate of each read was set to meet the uniformly decreasing distribution 
for an error rate of 0.1% at the start of the read increasing to 4% at the end of the read, 
and 5% probability to have a random DNA read, maximum 1 N allowed in a given 
read, the insertion size is 472 bp (mean distance between ends of read-pairs) with the 
standard deviation of 1 1 bp. The reference mimic was defined as to have a mutation 
rate of 0.001, the fraction of mutations that are indels was defined to 0.1, indel 
extension probability was set to 0.3, and the minimum length indel was set to 1 bp. 

Bioinformatics analysis and analytical program. Mapping of reads to the host 
genome (TIGR 7.0, including Chrl to Chrl2, Chloroplast, Mitochondrion, ChrSy, 
and ChrUn) or transformation vector (pBar-lc, pFHBTl, and pGL2RC7) was 
performed using the Burrows-Wheeler Aligner (BWA) algorithm (version 0.6.2) 21 . 
All the mapping processes were based on the single-end strategy. In module 1, 
parameter -n (i.e. maximum edit distance) was set to 10, 20, and 30 while keeping the 
other parameters default. In module 2 and 3, we used the default mapping parameters 
in BWA. The Assembly by Short Sequences (ABySS) algorithm version 1.3.4 22 was 
used in all assembly processes, parameter k (i.e. size of k-mer) was set to 23 bp and the 
other parameters were default. The analytical pipeline of module 1 to module 3 
analyses is implemented in Bash script under the Ubuntu X86_64 system. The source 
code and its executable file are available from the URL: http://gmdd.shgmo.org/ 
Computational-Biology /Transeq/Transeq.tar.bz2). 

Transgene element sequence library. A collection of altogether 134 sequence 
elements, mainly derived from transgenic plants and transformation vectors, were 
compiled from NCBI Genbank and GMDD databases 25 and served as a transgene 
element sequence library (Supplemental Table S3). 

Experimental validation of transgenic inserts. The transgene location, flanking 
region sequences and the inserts of transgenic rice Tlc-19 and TT51-1 were 
confirmed with conventional PCR and DNA sequencing. PCR amplifications were 
performed with the primers listed in Supplemental Table S2 and the PCR 
Amplification Kit (Takara Biotechnology Co., Ltd.) with the following cycling profile: 
5 min at 94 U C, 35 cycles of (30 sec denaturation at 94°C, 30 sec annealing at 52°C, 
and 1 min extension at 72 °C), followed by a 7 min additional extension step at 72°C. 
PCR products were separated by 2% agarose gel electrophoresis to visually inspect 
amplification products. Each product was purified from the gel using AxyPrep™ 
DNA Gel Extraction Kit ( Axygen Biosciences) and successively cycle sequenced using 
BigDye Terminators Version 3.1 kit (Applied Biosystems) and the relevant primers 
(Supplemental Table S2) on an ABI 3730 sequencer (Invitrogen). 

Verification of the number of inserts. Independent verification of the number of 
inserts based on module 1 data was obtained with the equation (Equation 1): 

X — , where X = insert detection index, N = number of reads 

(L v i + L v2 ...Lvm) x D x R 
mapped to the plasmid vector, L r = read length, L vl = length of insert no.l, L v2 = 
length of insert no. 2, etc., D = sequencing depth, and R — relative ratio of reads 
mapped to the host genome (detailed calculations described in Table 1). 
Theoretically, X — 1 if all inserts are detected and the sequencing data and analyses 
are perfect. 
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